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Q ■ Abstract 
(N 

$H ' Identifying the infection sources in a network, including the index cases that introduce a contagious disease 

into a population network, the servers that inject a computer virus into a computer network, or the individuals who 
started a rumor in a social network, plays a critical role in limiting the damage caused by the infection through 
timely quarantine of the sources. We consider the problem of estimating the infection sources and the infection 
regions (subsets of nodes infected by each source) in a network, based only on knowledge of which nodes are 
infected and their connections, and when the number of sources is unknown a priori. We derive estimators for the 
infection sources and their infection regions based on approximations of the infection sequences count. We prove 
that if there are at most two infection sources in a geometric tree, our estimator identifies the true source or sources 
with probability going to one as the number of infected nodes increases. When there are more than two infection 
sources, and when the maximum possible number of infection sources is known, we propose an algorithm with 
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' quadratic complexity to estimate the actual number and identities of the infection sources. Simulations on various 

m . 

. kinds of networks, including tree networks, small-world networks and real world power grid networks, and tests on 

■ two real data sets are provided to verify the performance of our estimators. 

(N , 

Index Terms 

Source estimation, infection graphs, inference algorithms, security, sensor networks, social networks. 

I. Introduction 

With rapid urbanization and advancements in transportation technologies, the world has become more inter- 
connected. A contagious disease like Severe Acute Respiratory Syndrome (SARS) can spread quickly through a 
population and lead to an epidemic HI. It is crucial to quickly identify the index cases of a contagious disease 
since it allows us to study the causes, and hence facilitate the search for antiviral drugs and efficacious therapies. 
Moreover, by inferring the the set of individuals infected by each source, potential containment policies can be 
formulated to prevent further spreading of the disease due to new index cases HI, ||3l. In a similar vein, a computer 
virus on a few servers of a computer network can quickly spread to other servers or computers in the network. 

This research was supported by the MOE AcRF Tier 1 Grant M52040000. W. Luo, W. R Tay and M. Leng are with the Nanyang 
Technological University, Singapore. E-mail: wluol @e . ntu . edu . sg, {wptay , lengmei}@ntu . edu . sg 
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Without prompt identification and isolation of the source servers, significant damage can result E], ||5]. Identifying 
the servers in the network that are first infected also allows us to detect the latent points of weaknesses in the 
computer network so that preventive measures can be taken to enhance the protection at these points. The source 
identification problem also arises in the study of rumor spreading in a social network. A rumor started by a few 
individuals can spread quickly through the underlying social network O-Q. In many cases, we are interested 
to find the sources of the rumor. For example, law enforcement agencies may be interested in identifying the 
perpetrators who fabricate false information to manipulate the market prices of certain stocks. 

We can model all the above examples as an infection spreading in a network of nodes. In a population network, 
the infection is the disease that is transmitted between individuals. In the example of a computer virus spreading 
in a network, the infection is the computer virus, while for the case of a rumor spreading in a social network, 
the infection is the rumor. We consider the problem of estimating the infection sources in a network of infected 
nodes. We are interested in the scenario where the only given information is the set of infected nodes and their 
connections. This is because typically, complete data about the infection spreading process, like the first times 
when the infection is detected at each node, is not available. Even when such detection times are available, the 
naive method of declaring the first detected node in the network as the sole infection source is often incoiTcct, as 
the infection may have a random dormant period, the length of which varies from node to node. For example, the 
spreading of a disease in a population with individuals having varying degrees of resistance, and hence exhibiting 
symptoms not necessarily in the order in which they are infected, presents such a problem. Our goal is to construct 
estimators for both the infection sources and their infection regions, i.e., the subset of nodes likely to be infected 
by each source, when the number and locations of the sources are unknown a priori. 

A. Related Works 

Existing works related to infection spreading in a network have primarily focused on the parameters of the 
diffusion process such as the outbreak thresholds and the effect of network structures l|T0l - ||T3l . Little work has 
been done on identifying the infection sources. Our aim is to identify a set of nodes most likely to be the infection 
sources after the infection has spread for some time. This formulation is of interest in various practical scenarios, 
including the spreading of a new disease in a population network. By identifying the initial infectious sources, we 
can focus scarce resources like DNA testing on a small select group of patients instead of on the whole population. 
Other examples include identifying the initial entry points of a computer virus into a computer network, and the 
initiators of a rumor in a social network. 

The case where there is a single infection source has been studied in 114.1 . Based only on the knowledge of 
which nodes are infected and the underlying network structure, an estimator based on the linear extensions count 
of a poset or the number of infection sequences (cf. Section JIl) was derived to identify the most likely infection 
source. It was shown in |[T4l that finding a single infection source is a #P-complete problem even in the case where 
the infection is relatively simple, with infection from an infected node being equally likely to be transmitted to 
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any of its neighbors at each time step. This simple infection model is based on the classical susceptible-infected 
(SI) model |[T5l . which has been widely used in modeling viral epidemics |[T6l - |[2ll . An algorithm for evaluating 
the single source estimator was proposed in llT4l . and it was shown to have complexitjQ 0{n) for tree networks, 
where n is the total number of infected nodes. Furthermore, it was shown that this estimator performs well in a 
very general class of tree networks known as the geometric trees (cf. Section IIII-Db . and identifies the infection 
source with probability going to one as n increases. 

In many appUcations, there may be more than one infection source in the network. For example, an infectious 
disease may be brought into a country through multiple individuals. Multiple individuals may collude in spreading 
a rumor or malicious piece of information in a social network. In this paper, we investigate the case where there 
may be multiple infection sources, and when the number of infection sources is unknown a priori. We also consider 
the problem of estimating the infection region of each source, and show that a direct application of the algorithm 
in |[T4l performs significantly worse than our proposed algorithms if there are more than one infection sources. 
We also note that [Q]! provides theoretical performance measures for several classes of tree networks, which we 
are unable to do here except for the class of geometric trees, because of the greater complexity of our proposed 
algorithms. Instead, we provide simulation results to verify the performance of our algorithms. 

A related problem is the detection and localization of diffusive sources using wireless sensor networks ll22l - ||T7l . 
The diffusion models used under this framework are based on spatio-temporal diffusion models fT2\ or state-space 
models with linear dynamics ||23l . where information like the physical positions of sensors are known. There is 
no natural translation of the source detection and localization problem in a sensor network to other networks Uke 
a computer network, without performing discretization and introducing a combinatorial aspect to the problem, as 
is done in ||28l and |[29l . Similarly, inference of viral epidemic processes in populations has been studied in |[TOl . 
|[T2l . |[T5l . where various features related to the propagation of a viral epidemic, such as the rates of infection 
and the length of latency periods are investigated. These works' focus is on specific viral infection processes with 
assumptions that do not naturally hold for infection processes in other networks. Moreover, there is little work on 
determining the sources or index cases of a disease. 

On the other hand, the infection source estimation algorithms we consider in this paper can be useful in 
applications like pollution source localization, where we are limited to inexpensive sensors capable only of detecting 
the presence or absence of a pollutant, and the identities of its neighbors. In this case, spatio-temporal diffusion 
models are not applicable as we only have knowledge of which nodes are "infected". The algorithms we study in 
this paper are also applicable to inferring infection sources in viral epidemics, when little information about the 
epidemic propagation characteristics is available. 

'a function f{n) = 0{g{n)) if f{n) < cg(n) for some constant c and for all n sufficiently large. 
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B. Our Contributions 

In this paper, we consider the estimation of multiple infection sources when the number of infection sources is 
unknown a priori. We adopt the same SI diffusion model as in |[T4l . as this has been widely used to model various 
infection spreading processes |[T6l - ll2n . The results of this work are applicable to scenarios where the infection 
spreads in an approximately homogeneous way, with infections happening independently. Examples include the 
spreading of a new disease in a human population, where nobody has yet developed any immunity to the disease. 
A novel computer virus attacking a network can also be modeled using a homogeneous spreading process. On the 
other hand, our model is highly simplistic and does not model many other spreading processes of practical interest. 
However, as alluded to earlier, finding the infection sources in this simple model is already very challenging. 
The focus of our work is not on modeling infection processes. Rather, by restricting our analysis to the simplest 
homogeneous exponential spreading model, we hope to gain insights into identifying multiple infection sources in 
real networks. We show that unlike the single source estimation problem, the multiple source estimation problem is 
much more complex and cannot be solved exactly even for regular trees. Our main contributions are the following. 

(i) For the case of a tree network, and when it is known that there are two infection sources, we derive an 
estimator for the infection sources based on the infection sequences count. The estimator can be calculated 
in O(n^) time complexity, where n is the number of infected nodes. 

(ii) When there are at most two infection sources that are at least two hops apart, we derive an estimator for 
the class of geometric trees based on approximations of the estimator in di), and we show that our estimator 
correctly estimates the number of infection sources and correctly identifies the source nodes, with probability 
going to one as the number of infected nodes increases. 

(iii) We derive an estimator for the infection regions of every infection source under a simplifying technical 
condition. 

(iv) For general graphs, when there are at most fcmax infection sources, we provide an estimation procedure for 
the infection sources and infection regions. Simulations suggest that on average, our estimators are within a 
few hops of the true infection sources in the infection graphs 

(v) We test our estimators on real data in Section IV-CI The first test is based on real contact tracing data of a 
patient cluster during the SARS outbreak in Singapore in 2003. Our estimator correctly identifies the number 
of index cases for the cluster to be one and successfully finds this index case. The second test considers the 
Arizona-Southern California cascading power outages in 2011. Our estimator correctly identifies the number 
of outage sources for the main affected power network to be two, and the distance between our estimators 
and the real sources are within 1 hop. These tests suggest that our estimator has reasonable performance in 

^In general, we do not know the whole underlying network, but rather the subgraph of infected nodes. For example, in the case of a 
contagious disease spreading in a population, we only perform contact tracing on the patients to construct the connections among them. From 
our simulation studies, the infection graph typically has an average diameter of more than 27 hops even though the underlying network's 
diameter is much smaller. 
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some applications even though we have adopted a simplistic infection model. 
The rest of the paper is organized as follows. In Section |II] we present the system model and problem formulation. 
In Section |llll we derive estimators for infection sources and regions for tree networks, and present algorithms to 
evaluate them. We also show asymptotic results for geometric tree networks. We discuss estimation algorithms for 
general graphs in Section JV] In Section |Vl we present simulations and tests on real data to verify the performance 
of our proposed estimators. Finally we conclude and summarize in Section |Vll 

II. Problem Formulation 

In this section, we describe our model and assumptions, introduce some notations, and present some preliminary 
results. Consider an undirected graph G = {V,E), where V is the set of nodes and E is the set of edges. If 
there is an edge connecting two nodes, we say that they are neighbors. The neighborhood Ng{v) of a node v is 
the set of all neighbors of v in G. The length of the shortest path between u and v is denoted as d{u,v). In a 
computer network, the graph G models the interconnections between computers in the network. In the example of 
a population or a social network, V is the set of individuals, while an edge in E represents a relationship between 
two individuals. We define an infection to be a property that a node in G possesses, and can be transmitted to 
a neighboring node. When a node has an infection, we say that it is infected. The neighbors of an infected node 
is said to be susceptible. We assume the susceptible-infected model |[T5l . where once a node has been infected, it 
will not lose its infection. We adopt the same infection spreading process as in |[T4l . where the time taken for an 
infected node to infect a susceptible neighbor is exponentially distributed with rate 1. All infections are independent 
of each other. Therefore, if a susceptible node has more than one infected neighbors and subsequently becomes 
infected, its infection is transmitted by one of its infected neighbors, chosen uniformly at random. For mathematical 
convenience, we also assume that G is large so that boundary effects can be ignored in our analysis. 

Suppose that at time 0, there are > 1 nodes in the infected node set S* = {si, . . . ,Sk} C V. These are the 
infection sources from which all other nodes get infected. Suppose that after the infection process has run for some 
time, and n nodes are observed to be infected. Typically, n is much larger than k. These nodes form an infection 
graph Gn = {Vn,En), which is a subgraph of G. Let A*^ = uf^-^^An^i be a partition of the infected nodes Vn so 
that An^i n Anj = for i / j, with each partition A„ j being connected in G„, and consisting of the nodes whose 
infection can be traced back to the source node s^. The set A„ j is called the infection region of Sj, and we say 
that An is the infection partition. Given Gn, our objective is to infer the sources of infection S* and to estimate 
An- In addition, if we do not have prior knowledge of the number of infection sources k, we also aim to infer 
the number of infection sources. Without loss of generality, we assume that G„ is connected, otherwise the same 
estimation procedure can be performed on each of the components of the graph. We also assume that there are 
at most fcjnax infection sources, i.e., the number of infection sources k < fcmax- From a practical point of view, if 
two infection sources are close to each other, we can ignore either one of them and treat the infection as spreading 
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from a single source. Therefore, we are interested in cases where the infection sources are separated by a minimum 
distance. These assumptions are summarized in the following. 

Assumption 1. The number of infection sources is at most k^nax, cmd the infection graph Gn is connected. 

Assumption 2. For all Si, Sj € S*, the length of the shortest path between them d{si, Sj) > r, where r is a constant 
greater than 1. 

Assumption 3. Every node in G has bounded degree, with being the maximum node degree. 

Suppose that our priors for S* and ^* are uniform over all possible realizations, and let P be the probability 
measure of the infection process. We seek S and An that maximize the posterior probability of S* and A*n given 

Gn, 

P(5* = S,A*n = An\ Gn) « | S)P{An \ S, Gn), (1) 

where P{Gn \ S) is the probability of observing G„ if S is the set of infection sources, and P{An j S, Gn) is the 
probability that An = An conditioned on S being the infection source set and the infection graph being Gn. 

For any source set S, let an infection sequence a = (cti, . . . , (Jn-k) be a sequence of the nodes in G„, excluding 
the the k source nodes in S, arranged in ascending order of their infection times (note that with probability one, 
no two infection times are the same). For any sequence to be an infection sequence, a necessary and sufficient 
condition is that any infected node ai, i = 1, . . . ,n — k, has a neighbor in 5 U {ai, . . . , o"j_i}. We call this the 
infection sequence property. An example is shown in Figure [T] Let S) be the set of infection sequences for 

an infection graph G„ and source set S, and let G{S \ Gn) = \ ft{Gn,S)\ be the number of infection sequences. 
We have 

PiGn\S)= J2 ^(^1^)' (2) 

where P{a \ S) is the probability of obtaining the infection sequence a conditioned on S being the infection 
sources. 

Evaluating the expression Q and maximizing ([1) for a general G„ is a computationally hard problem as it 
involves combinatorial quantities. As shown in |[T4l . if G is a regular tree and l^l = 1, P(G„ | 5) is proportional 
to \^}{Gn, S)\, which is equivalent to the number of linear extensions of a poset. It is known that evaluating the 
linear extensions count is a #P-complete problem ||30l . As such, we will make a series of approximations to simplify 
the problem, and present numerical results in Section to verify our algorithms. The first approximation we make 
is to evaluate the estimators 

S = arg max PiGn I -S*), (3) 

S<ZVr, 
|5|<fc„ax 

An{S) = arg max P( A | S, Gn), (4) 
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Fig. 1. Example of an infection sequence. The shaded nodes are the infected nodes which form the infection graph G„. Infection sources 
are S = {si , S2}. The sequence (u2, M4) is an infection sequence, but (ui, U2) is not. The probability of the infection sequence o = {u2, Ui) 
is then given by P((t | S) = | x i = i. The first fraction 2 is obtained by observing that when only si and S2 are infected, there are four 
edges (si,ii2), (31,113), (82,112), and (52,115) for the infection to spread. The infection is equally likely to spread along any of these four 
edges, out of which two results in the infection of node U2. After U2 is infected, there are 4 edges over which the infection can spread and 
this corresponds to the fraction j. 



instead of the exact maximum a posteriori (MAP) estimators for ([T}. Even with this approximation, the optimal 
estimators are difficult to compute exactly, and may not be unique in general. Therefore, our goal is to design algo- 
rithms that are approximately optimal but computationally efficient. In Section Hill we make further approximations 
and design algorithms to evaluate the estimators S and ^^(5") when G is a tree. In Section ITVl we consider the 
case when G is a general graph. For the reader's convenience, we summarize some notations commonly used in 
this paper in Table U Several notations have been introduced previously, while we formally define the remaining 
ones in the sequel where they first appear 

III. Identifying Infection Sources and Regions for Trees 

In this section, we consider the problem of estimating the infection sources and regions when the underlying 
network G is a tree. We first derive an estimator for the infection partition in dU), given any source node set S and 
G„. Then, we derive an estimator based on the number of infection sequences. Next, we consider the case where 
there are two infection sources, propose approximations that allow us to compute the estimator with reasonable 
complexity, and show that our proposed estimator works well in an asymptotically large geometric tree under some 
simplifying assumptions. In most practical applications, the number of infection sources is not known a priori. 
We present a heuristic algorithm for general trees to estimate the infection sources when the number of infection 
sources is unknown, but bounded by fcmax- 

A. Infection Partition with Multiple Sources 

In this section, we derive an approximate infection partition estimator for given any infection source set S. 
This estimator is exact under a simplifying technical condition given in Theorem [T] below, the proof of which is 
provided in Appendix [A] 
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TABLE I 

Summary of notations used. 





l.'clllll LIUll 




underlying network 




length of the shortest psth between u 3nd v 


J\G{U) 


set of neighbors of u in G 




number of neighbors of node n in G 




infection graph with n infected nodes 




1111 i^v, injii avjuii^ta 




infection partition of an infection graph Gn 


A 


infection region of an infection source Si 


n(G„,s) 


set of infection sequences for an infection graph Gn 
and source set S 


C{S 1 G„) 


= \n(Gn,s)\ 


Symbol 


Definition (defined impUcitly w.r.t. Gn) 


p{u, v) 


piith between u cind v in the infection griiph Gn 


T4S) 


tree in Gn, rooted at v w.r.t. source set 5* 


Tm{S) 


— Uv£MTy{S), where M is a subset of nodes 


ms) 


~ '^j<i I^Sj ('S')I' where ^ is a sequence of nodes 


A*(S1,S2) 


total number of nodes in the i biggest trees 

in {r„(si, S2) : u £ p(si, S2)} 



Theorem 1. Suppose that G is a tree with infection sources S, and Hn is the subgraph of Gn consisting of all 
paths between any pair of nodes in S. If any two paths in Hn do not intersect except possibly at nodes in S, then 
the optimal estimator An{S) for the infection partition is a Voronoi partition of the graph Gn, where the centers 
of the partitions are the infection sources S. 

A Voronoi partition may not produce the optimal estimator for the infection partition in a general infection graph. 
However, it is intuitively appealing as nodes closer to a particular source are more likely to be infected by that 
source. For simplicity, we will henceforth use the Voronoi partition of the infection graph G„ as an estimator for 
A*n, and present simulation results in Section |V] to verify its performance. We will also see in Section UlI-EI that 
this approximation allows us to design an infection source estimation algorithm with low complexity. 

B. Estimation of Infection Sources 

We now consider the problem of estimating the set of infection sources S* . When IS"*! = 1, our estimation 
problem reduces to that in iPHl . which considers only the single source infection problem. In the following, we 
introduce some notations, and briefly review some relevant results from |[T4l . 

A path between any two nodes u and v in the tree G„ is denoted as p{u,v). For any set of nodes S in G„, 
consider the connected subgraph Hn C G„ consisting of all paths between any pair of nodes in S. Treat this 
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Hn 




Fig. 2. A sample infection graphi witli S = {si, 82} 

subgraph as a "super" node, with the tree Gn rooted at this "super" node. For any node v S Gn\Hn, we define 
Ty{S) to be the tree rooted at v with the path from v to Hn removed. For v G Hn, we define T^{S) to be the tree 
rooted at v so that all edges between v and its neighbors in Hn are removed!^ We say that Tv{S) is the tree rooted 
at V with respect to (w.r.t.) S. For any subset of nodes M C Gn, we let Tm{S) = Uv^mTv{S). An illustration 
of these definitions is shown in Figure |2] If 5 = {si, . . . , Sk}, we will sometimes use the notation T„(si, . . . , Sk) 
instead. 

Recall that C{S \ Gn) is the number of infection sequences if S is the infection source set. If there is a single 
infection source node S = {s}, and G is a regular tree where each node has the same degree, it is shown in |[T4l 
that the MAP estimator for the infection source is obtained by evaluating S = argmax^gc^ C{v \ Gn), which 
seeks to maximize C{v \ Gn) over all nodes. Therefore, it has been suggested that C(v \ Gn) can be used as the 
infection source estimator for general trees. The following result is provided in llT4l . 

Lemma 1. Suppose that Gn is a tree. For any node s G Gn, we have 

C{s\Gn) = n\ J] \Tu{s)\-\ (5) 

We observe that each term |T„(s)| in the product on the right hand side (R.H.S.) of ^ is the number of nodes 
in the sub-tree T„(s) (and which appears when we account for the number of permutations of these nodes). We can 
think of the terms in the product being ordered according to the infection spreading sequence, i.e., each time we 
reach a particular node u, we include terms corresponding to the nodes u can potentially infect. This interpretation 
is useful in helping us understand the characterization in Lemma |2] for the case where there are two infection 
sources. 

To compute G{v \ Gn), an 0{n) algorithm based on Lemma [T] was provided in |[T4l . We call this algorithm the 
Single Source Estimation (SSE) algorithm. We refer the reader to llT4l for details about the implementation of the 
algorithm. Although finding S by maximizing C{s \ Gn) is exact only for regular trees, it was shown in ||T41 that 
this estimator has good performance for other classes of trees. In particular, if G is a geometric tree (cf. Section 
IIII-Db . then the probability, conditioned on S* = {s}, of correctly identifying s using C{s \ Gn) goes to one as 

'As Tv{S) is defined on Gn, its notation shiould include G„. However, in order to avoid cluttered expressions, we drop Gn in our 
notations. Confusion will be avoided through the context in which these trees are referenced. 
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Fig. 3. A sample infection grapli with S — {si,S2}. Given an infection sequence a = (its, tii, 112) G S2), {si, S2}), we 

can find the corresponding reverse infection sequence ^ = (?i2, ui, its). We have /i(^;si,S2) ~ |ru2 (•^Ij *2)| ~ 1, l2{(.', si, S2) = 

\Tu^{si,S2)\ + \Tu,{si,S2)\ =4, h{^;Si,S2) = \Tu^{si,S2)\ + \TuAsi,S2)\ + |r„3(si,S2)| = 6. 

n — )• cx). Inspired by this result, we propose estimators based on quantities related to C{S \ Gn) for cases where 
\S*\ > 1. In the following, we first discuss the case where \S*\ = 2, and extend the results to the general case 
where |5*| is unknown in Section UlI-EI We then numerically compare our proposed algorithms with a modified 
SSE algorithm adapted for finding multiple sources in Section |Vl 

C. Two Infection Sources 

In this section, we assume that there are two infection sources S = {si,S2}- Given two nodes u and v in G„, 
suppose that \p{u,v)\ = m. For any permutation ^ = (^1, . . . ,^rn) of the nodes in p{u,v), let 

-^j(?;Sl,S2) = ^ |T^,(S1,S2)| (6) 

j<i 

be the total number of nodes in the trees rooted at the first i nodes in the permutation ^. We have the following 
characterization for C(si,S2 I Gn), whose proof is given in Appendix IB] 

Lemma 2. Suppose that Gn is a tree. Consider any two nodes si and S2 in Gn, and suppose that p{si,S2) = 
(si,ni, . . . S2). We have 

G{si,S2 \ Gn) = {n -2)] ■ q{ui,Um;si,S2) ■ Y[ \Tu{si,S2)\~^, (7) 

where for 1 < i < j < m, q{ui, uj; si, S2) satisfies the following recursive relationship 

q{ui,Uj;si,S2) = \Tp(^u,^u,)isi, S2)\~'^ {q{ui+i,Uj; si, S2) + q{ui,Uj-i; si, S2)) for i < j, (8) 
with q{v,v; si, S2) = \Tv{si, S2)\~^ for all v G p{ui,Um)- Furthermore, we have 

m 

q{ui,u„i]Si,S2) = ^ J|/i(^;si,S2)"\ (9) 

Cer(«i,u„) i=i 

and T{ui,Um) is the set of all permutations ^ = (.^1, . . . ,^m) of nodes in p{ui,Um) such that (S^m, ■■■,£,!) is an 
infection sequence starting from si and S2 and resulting in p{si,S2). 

The characterization for C(si, S2 \ Gn) is similar to that for the single source case in (|5), except for the additional 
q{ui,Um] si, S2) term. We first clarify the meaning of r{ui,Uni). Given any infection sequence a that starts with 
{si,S2} and results in p{si,S2), i.e., a = (fJi, . . . , Cm) S ^{p{si, S2), {si, S2}), we can find a permutation ^ = 
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(Ci; ■ ■ ■ ) Cm) of nodes in p{ui, Um) such that = cjm-i+i for i = 1, . . . , m. In other words, ^ can be interpreted as 
the reverse infection sequence corresponding to a. Then T{ui,Um) is the set of all such reverse infection sequences 
corresponding to il(p(si, S2), {si, S2}). We show an illustration of these definitions in Figure |3] Each term |Tu(s)| 
in the product in the R.H.S. of ([5]) can be interpreted as the number of nodes that can be infected via u once 
u has been infected. Similarly, the sum in (|9]l is over all possible reverse infection sequences ^ of the nodes in 
p{ui,Um), and each term /i(^;si,S2) in the product within the sum is the number of nodes that can be infected 
once ^i+i, . . . ,Crn have been infected. 

By utilizing Lemma|2l we can compute C(n, v \ Gn) for any two nodes u and v in G„ by evaluating |T^(n, v)\ 
for all nodes w G Gn, and the quantity q{ui,Um',u,v), where p{u,v) = {u,ui, . . . ,Um,v). With Assumption [3l 
Algorithm [T] allows us to compute fw{u) = \T.a,{u)\ and gw{u) = HtjeT (m) neighbors u of w, and 

for all w G G„ in 0{n) time complexity. To do this, we first choose any node r G G„, and consider G„ as a 
directed tree with r as the root node, and with edges in G„ pointing away from r. Let pa(?i;) and ch(?i;) be the 
parent and the set of children of w in the directed tree G„, respectively. Starting from the leaf nodes, let each 
non-root node w G G„ pass two messages containing /^u(pa(w)) and (/^(pa(t(;)) to its parent. Each node stores the 
values of these two messages from each of its children, and computes its two messages to be passed to its parent. 
When r has received all messages from its children, a reverse sweep down the tree is done so that at the end of 
the algorithm, every node w G G„ has stored the values {fu{w), gu{w) : u G 7Vg„(w^)}- The algorithm is formally 
described in Algorithm [T] The last product term on the R.H.S. of ^ can then be computed using 

9isi,s2)= n n 9x{w), (10) 

wep{si,S2) X£j^G„iw)\p{si ,82) 

and taking its reciprocal. 

To compute C{si,S2 \ Gn) in ©, we still need to compute q{ui,Urn', si, 82)- The recurrence (|8) allows us to 
compute q{ui,Um:, si, S2) for all si,S2 G G„ in 0{v?(P^) complexity, where is the maximum node degree. The 
computation proceeds by first considering each pair of neighbors {u,v). Both nodes have at most neighbors 
each, so that we need to evaluate q{u,v; si, S2) for all si G J\fGr,{u)\p{u,v) and S2 G 7Vg„ (v)\p(n, u ). This 
requires 0{d1) computations. The computed values and rp(^^)(si, S2) are stored in a hash table. In the next step, 
we repeat the same procedure for node pairs that are two hops apart, and so on until we have considered every pair 
of nodes in G.„. Note that for a path {ui,. . . , Um) and si, S2 neighbors of ui and Um respectively, q{u\,Uni\ 81,82) 
can be computed in constant time from ^ as q{u2,Um] 81,82) = qiu2,Uni;ui, 82) and q{ui,Um-i', si, 82) = 
q{ui,Um-i; 8i,Um)- A similar remark applies for the computation of |rp(„^ S2)|. In addition, each lookup 

of the hash table takes 0(1) complexity since G„ is known and collision-free hashing can be used. Therefore, 
the overall complexity is 0{v?d^^). The algorithm to compute the infection sources estimator is formally given in 
Algorithm |2] We call this the Two Source Estimation (TSE) algorithm, and it forms the basis of our algorithm for 
multiple sources estimation in the sequel. 
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Algorithm 1 Tree Sizes and Products Computation 



1: 


Inputs: Gn 


2: 


Choose anv node v P ^r.. as the root node 


3: 


for G G„ do 


4: 


Store received messages friw) and Qt.(w), for each x € chfw). 


5: 


if is a leaf then 


6: 


foi,('P8i.(w)) = 1 


7: 


Q,„(pa(i/j)) = 1 


8: 


else 


9: 




10: 


Q^ndiadu)) = f,n(v>&(w)) • rr ^ \ Qt(w) 


11: 


end if 


12: 


Store fr,»(v,)(w) = n — fw(va(w)). 


13: 


Pass /?i)(pa(ttj)) and (7^o(pa(u')) to pa(u'). 


14: 


end for 


15: 


Set a 1 \{r) = 1 


16: 


for E Gn do 


17: 


Store received message qr,.^i,,,\{w) from pa(«;). 


18: 


if is not a leaf then 


19: 


for T G rhf?/;^ do 


20: 




Zl. 




22: 


end for 


23: 


end II 


24: 


end for 


D. 


Geometric Trees with Two Sources 



In this section, we study the special case of geometric trees, propose an approximate estimator for geometric 
trees, and provide theoretical analysis for its performance. First, we give the definition of geometric trees and prove 
some of its key properties. Then, we derive a lower bound for G{S j G„), and propose an estimator based on this 
lower bound. We show that our proposed estimator is asymptotically correct, i.e., it identifies the actual infection 
sources with probability (conditioned on the infection sources) going to one as the infection graph Gn becomes 
large. For mathematical convenience, instead of letting the number of infected nodes n grow large, we let the time 
t from the start of the infection process to our observation time become large. 

The geometric tree network is defined in fTTl w.r.t. a single infection source. In the following, we extend this 
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Algorithm 2 Two Source Estimation (TSE) 
1: Input: Gn 

2: Let {si, S2) be the maximizer of C(-, • | G„). Set C* = 0. 

3: for d = 1 to diameter of G„ do 

4: for each si G do 

5: for each S2 such that d{si, S2) = d do 

6: Let p{si,S2) = (si,tii, . . • ,Ud_i,S2). 

7: if d = 1 then 

8: q{ui,Ud-i;si,S2) = I. 

9: else if d = 2 then 

10: Store q{ui,ui;si,S2) = jT^j (si, 52)!"^ and |T„j(si, S2)|. 

11: else 

12: Lookup |rp(„^^„^_^)(si,Urf_i)|, q{u2,Ud-i;ui,S2), and nrf_2; si, Urf^i). 

13: Store 

14: Store 

/ X g(u2,Urf_i;'Ui,S2) + g('Ui,'Ud-2;si,'«d-i) 

q{ui,Ud-i;Si,S2) = r= 

|-'p(«i,«d_i)l'Sl,'S2j| 

15: end if 

16: Compute g{si,S2) from ([TOl i. 

17: C(si,S2 I G„) = (n - 2)!g(ui,Ud_i;si,S2)/5(si,S2)- 

18: Update {s\,s^) and C* if C{si,S2 \ Gn) > G*. 

19: end for 
20: end for 
21: end for 



definition to the case where there are two sources. Let S* = {si,S2} be the infection sources, and let T^(si,S2) 
be defined in the graph G in the same way as Tu{si, S2) is defined for G„. Let Mg{p{si, S2)) be the set of nodes 
that have a neighboring node in p{si, 52)- For each node u, let n(n, r) be the number of nodes in r^(si, S2) that 
are at a distance r from u. We say that G is a geometric tree if for all u G M{p{si, S2)), we have 

br"' <n{u,r) <cr°', (11) 

where a, b, and c are fixed positive constants with b < c. The condition ([TT]) implies that all trees defined w.r.t. 
the infection sources are growing polynomially fast at about the same rate. As we have assumed that the infection 
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Fig. 4. Addition of virtual nodes xi and X2- 

rates are homogeneous for every node, the resulting infection graph G„ will also be approximately regular with 
high probability. We have the following properties for a geometric tree, whose proofs are in Appendix ICl 

Lemma 3. Suppose that G is a geometric tree with two infection sources S* = {si,S2}. Let a,b and c be fixed 
positive constants satisfying (|11|) for the geometric tree G. Let t be the time from the start of the infection process 
to our observation time. For any e G (0, 1), let £t be the event that all nodes within distance t{l — t"^/^"*""^) of 
either source nodes are infected, and no nodes greater than distance t{l + of either source nodes are 

infected. Then, there exists to such that for all t > to, P{St) > 1 — £■ Furthermore, conditioned on £t, we have for 
all u e 7Vg(si) U7Vg(s2) or u = p{si,S2)\S*, 

iVmin(i) < \Tu{su S2)\ < A^max(t), (12) 

where 

iVmin(i) = (t - - d{si,S2) " 2)"^' , (13) 

1 + Q V / 

and 

iVmax(i) = — ^ (t + tH^)"""' . (14) 

1 + a V / 

In addition, for t > to, we have 

The infection sequences count in Q is not amendable to analysis. In the following, we seek an approximation to 
simplify our analysis. For si, S2 € G„, suppose that p{si,S2) = {si,ui, . . . ,Um, S2), with p = \p{si, 82)] = m + 2. 
Instead of computing C(si,S2 I we consider a new infection graph G'^ with two "virtual" nodes Xj, i = 1,2 
added, where Xj is attached to Sj (see Figure IDl. We now consider the infection sequence count C(xi,X2 | G^) > 
G{si,S2 I Gn). Since the trees rooted at Xi axe single node trees, we have 

C(xi, X2 I G'J = C{si,X2 I G'J + G{xi,S2 I G'J 
<2(n-l)C(si,S2 I Gn), 

where the last inequality follows because if si and X2 are sources, then S2 can be inserted in any of at most n — 1 
positions in an infection sequence from 0,{Gn, {si, S2}), so that C(si, X2 | G'^) < {n — l)C(si, S2 \ Gn). A similar 
argument holds for C(xi,S2 | G'n) < {n — l)C(si,S2 j 
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Let ^* = {^l, . . . ,^*) be a permutation of the nodes in p(si,S2) such that \T^'{si,S2)\ > \T^'{si, S2)\ for all 
1 < ^ < i < P> the nodes in ^* are arranged in descending order of the size of the sub-trees rooted at them. 
Let /*(si, S2) = Si, ■52) (cf. the definition in Q) be the total number of nodes in the i biggest sub-trees in 

{Tu{si,S2) ■ u e /o(si,S2)}. From Lemma |2] we have 

p 

C{xuX2\G'^)>n\-2P-^\{l*{suS2)-^ \{ \Tu{si, S2)\-\ (15) 

i=l uGG„\p{si,S2) 

where the inequality holds because |r(si,S2)| = 2^"^, and each term in the sum on the R.H.S. of I© is lower 
bounded by YVi=i ^2)^^- We use the lower bound in ([TST i as a proxy for C(si, S2 \ Gn)- However, we have 

used a very loose lower bound in ([TST l. so we propose the estimator 

= arg max C{si,S2\Gn), (16) 

where 

C(si,S2 I Gn) = n\ ■ Q{sus2) n \Tuisi,S2)\-\ (17) 

u£G„\p{si,S2) 

Q{si,s2) = [2{i + 5)r'fli:{sus2r\ 

i=l 

and 5 is a fixed positive constant, to be chosen based on prior knowledge about the graph G. Algorithm |2] can be 
modified to find the maximizer for C'(-, • | G„). We call this the geometric tree TSE algorithm. The following result 
provides a way to choose 6, and shows that our proposed estimator S is asymptotically correct in a geometric tree. 
A proof is provided in Appendix |D] 

Theorem 2. Suppose that G is a geometric tree with two infection sources S* = {5^,52}- dmin end dmax be 
constants such that deg(^(sj) G [rfmim ^max] far i = 1,2. Let b and c be fixed positive constants satisfying (II II ) for 
the geometric tree G. Suppose that 

dmin - ^\/2'imax- (18) 

Then, for any 6 in the non-empty interval 

(cdmax -j^ b{dmm 2) 

Kdmin - 1) ~ ' 2'c 

we have 

lim P(5 = S* I S*) = I. 

Theorem |2] implies that if we know the constants governing the regularity condition ([TTI l for G, we can choose a 5 
so that our estimator S gives the true infection sources with high probability if the infection graph G„ is large. The 
class of geometric trees as defined by (ITTI i can be used to model various scenarios in practice, e.g., a tree spanning 
a wireless sensor network with nodes randomly scattered. However, the assumption ([TTI ) may also be overly strong 
for other applications. In Section |V] we perform numerical studies to gain insights into the performance of our 
proposed estimator for different classes of tree networks. 
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E. Unknown Number of Infection Sources 

In most practical applications, the number of infection sources is not known a priori. However, typically we may 
be able to guess the maximum number of infection sources /cmax> or we can choose a reasonable value of /cmax 
depending on the size of the infection graph G„. In this section, we present a heuristic algorithm that allows us to 
estimate the infection sources with a given /cmax- 

We first consider the instructive case where /cmax = 2 and G is a geometric tree. In this case, the number of 
infection sources can be either one or two. Suppose we run the geometric tree TSE algorithm on G„. We have the 
following result, whose proof is in Appendix |E] 

Theorem 3. Suppose that there is a single infection source s and G is a geometric tree with (1111) holding for all 
nodes u that are neighbors of s. Suppose that s has degree degfj(s) G [dmimd^ax], where dmin and dmax are 
positive constants satisfying (1181 ). Then, for any 5 in the interval (1191 ). the geometric tree TSE algorithm estimates 
as sources s and one of its neighbors with probability (conditioned on s being the infection source) going to 1 as 
t ^ oo. 

Theorem |3] implies that when there exists only one source, the geometric tree TSE algorithm finds two neighboring 
nodes, one of which is the true source. From Theorem and Assumption |2l if there are two sources, our algorithm 
identifies the two source nodes, which are at least two hops from each other, with high probability. Therefore, by 
checking the distance between the two nodes identified by the geometric tree TSE algorithm, we can estimate the 
number of source nodes in the infection graph. This observation together with Theorem [T] suggest the following 
heuristic. 

(i) Randomly choose /cmax nodes satisfying Assumption |2] as the infection sources and find a Voronoi partition 
for Gn- Use the SSE algorithm to find a source node for each infection region. Repeat these steps until for 
every region, the distance between estimated source nodes between iterations is below a fixed threshold or a 
maximum number of iterations is reached. We call this the Infection Partition (IP) Algorithm (see Algorithm 

n. 

(ii) For any two regions in the partition obtained from step that are connected by an edge in G„, run the TSE 
algorithm in the combined region to find two source estimates. If the two estimates have distance less than 
T, we decrement the number of source nodes, and repeat step (0. 

(iii) The above two steps are repeated until no two pairs of regions in the Voronoi partition can be combined. The 
formal algorithm is given as the Multiple Sources Estimation and Partitioning (MSEP) algorithm in Algorithm 

SI 

To compute the complexity of the MSEP algorithm, we note that since the IP algorithm is based on the SSE 
algorithm, it has complexity 0(?i). For each value of A; = 1, . . . , fcmax in the MSEP algorithm, there are 0(/c^) 
pairs of neighboring regions in the infection partition. For each pair of region, the TSE algorithm makes 0{v?) 
computations. Summing over all A; = 1, ... , /cmax, the time complexity of the MSEP algorithm can be shown to be 
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Algorithm 3 Infection Partitioning (IP) 

1: Inputs: An infection source set S^^^ = {sf^^ : i = 1, . . . , m} in G„. 

2: Iterations: 

3: for / = 1 to Maxlter do 

4: Run the Voronoi partitioning algorithm with centers in 5^'"^^ to obtain the infection partition A^''^ = U'^iAf\ 

5: for i = 1 to m do 

6: Run SSE algorithm in to obtain 

sf^ = arg max C(s | ^f^). 

7: end for 

8: :={sf :i=l,...,m} 

9: if maxi<j<m d{sf\ sf < r] for some fixed small positive r] then 

10: break 

11: end if 

12: end for 

13: return 



0{kf^^^n?). On the other hand, to compute C{S \ Gn) for |5*| = /cmax would require ©(n*^""") computations. 

IV. Identifying Infection Sources and Regions for General Graphs 

In this section, we generalize the MSEP algorithm to identify multiple infection sources in general graphs G. 
In |[T4l . the SSE algorithm is extended to general graphs when it is known that there is only a single infection 
source in the network using a heuristic. The algorithm first chooses a node s of G„ as the root node, and generates 
a spanning tree Tbfs(s,G„) of G„ rooted at s using the breadth-first-search (BFS) procedure. The SSE algorithm 
is then applied on this spanning tree to compute C{s \ Tbfs(s, Gn)). In addition, the infection sequences count is 
weighted by the likelihood of the BFS tree. This is repeated using every node in G„ as the root node, and the node 
s with the maximum weighted infection sequences count is chosen as the source estimator, i.e., 

s = argmaxP(f7i, | v)C{s \ Tus{v,Gn)), 

where is the sequence of nodes that corresponds to an infection spreading from v along the BFS tree. It can 
be shown that this algorithm has complexity O(n^). For further details, the reader is referred to [ll4l . We call this 
algorithm the SSE-BFS algorithm in this paper. 

We adapt the MSEP algorithm for general graphs using the same BFS heuristic. Specifically, we replace the 
SSE algorithm in line [6] of the IP algorihm with the SSE-BFS algorithm. In addition, in line |9j we run the TSE 
algorithm on Tbfs(sj,^j) U Tbfs(sj, Aj), where the two BFS trees are connected by randomly selecting an edge 
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Algorithm 4 Multiple Sources Estimation and Partitioning (MSEP) 



1: 


InDuts' and krr,^,^ 


2: 


Initialization : 


3: 


k := kyastx choose 5* := {si, . . . , s/^} randomly in Gn- 


4: 


Iterations: 


5: 


while /c > 1 do 


6: 


{S, A) = Algorithm IP(5) 


7: 


5' := S 


8: 


for all regions and Aj in the partition A such that there exists an edge (u, v) in Gn with u G j4j and 




f € do 


9: 


Set (u, v) = Algorithm TSE{Ai U Aj). 


10: 


if "u) < T then 


11: 


Merge A^ and Aj, set Si = u and discard Sj 


12: 


/fc := - 1 


13: 


break 


14: 


end if 


15: 


end for 


16: 


iS S = S' then 


17: 


break 


18: 


end if 


19: 


end while 


20: 


return (5, ^) 


(u, 


v) in G„ with u € 7bfs(si, Aj) and t; G Tbfs(sj, Aj). We call this modified algorithm the MSEP-BFS algorithm. 



Since the worst case complexity for the SSE-BFS algorithm is O(n^), the complexity of the MSEP-BFS algorithm 
can be shown to be 0{k'^^-^n?'), which is the same complexity as the MSEP algorithm. To verify the effectiveness 
of the MSEP-BFS algorithm, we conduct simulations on both synthetic and real world networks in Section |V] 

V. Simulation Results and Tests 

In this section, we present results from simulations and tests on real data to verify our proposed algorithms. We 
first consider geometric tree networks and regular tree networks with various numbers of infection sources, and then 
we present results on small- world networks and a real world power grid network. We also apply our algorithms 
to the contact tracing data obtained during the SARS outbreak in Singapore in 2003 |[T1 and the Arizona-Southern 
California cascading power outages in 2011 IIBTI . 
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Fig. 5. Estimating the number of infection source nodes. 

TABLE II 



Performance comparions. 



Simulation setting 


s 


Average 




Average error 


distance A 




Average minimum 


network topology 


15*1 


diameter 


MSEP/MSEP-BFS 


nSSE 


infection region 


of G„ 


77 = 


ri — diameter 


77 = 


77 = diameter 


known IS* | 


covering percentage (%) 


geometric trees 


2 


63.7 


0.61 


1.72 


9.65 


30.16 


12.85 


97.06 


3 


66.2 


0.91 


2.42 


7.69 


29.95 


14.84 


89.77 


regular trees 


2 


40.5 


0.84 


6.07 


4.50 


17.70 


6.13 


73.82 


3 


43.7 


0.94 


6.24 


3.39 


17.47 


6.59 


65.95 


small-world networks 


2 


35.5 


2.95 


8.19 


5.40 


17.13 


8.28 


76.62 


3 


40.9 


2.58 


8.18 


4.99 


18.56 


10.37 


60.69 


power grid network 


2 


27.3 


3.65 


7.39 


5.50 


14.66 


7.89 


70.29 


3 


30.8 


2.85 


8.47 


4.71 


14.75 


8.89 


59.95 



A. Synthetic Networks 

We perform simulations on geometric trees, regular trees, and small-world networks. The number of infection 
sources \S*\ are chosen to be 1, 2, or 3, and we set k^ax = 3. For each type of network and each number of 
infection sources, we perform 1000 simulation runs with 500 infected nodes. We randomly choose infection sources 
satisfying Assumption [2] and obtain the infection graph by simulating the infection spreading process using the SIR 
model. Finally, the MSEP or MSEP-BFS algorithm for tree networks and small-world networks respectively, is 
applied to the infection graph to estimate the number and locations of the infection sources. The estimation results 
for the number of infection sources 1 5*1 in different scenarios are shown in Figure |5] It can be seen that our algorithm 
correctly finds the number of infection sources more than 93% of the time for geometric trees, and more than 71% 
of the time for regular trees. The accuracy of about 69.2% for small-world networks is worse than that for the 
tree networks, as the infection tree for a small-world network has to be estimated using the BFS heuristics, thus 
additional errors are introduced into the procedure. 

When there are more than one infection sources, we compare the performance of the MSEP algorithm with a 
naive estimator based on the SSE algorithm. We call this the nSSE algorithm. Specifically, in the estimator for tree 
networks, we first compute C(n | G„) for all nodes u € Gn, and choose the \S*\ nodes with the largest counts as 
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the source nodes. In non-tree networks, we use the SSE-BFS algorithm. Since the nSSE algorithm can not estimate 
\S*\, we consider two variants. In the first variant, we assume the nSSE algorithm has prior knowledge of \S*\. In 
the second variant, we guess |5*| by choosing uniformly from {1, . . . , fcmax}- 

To quantify the performance of each algorithm, we first match the estimated source nodes S = {si : i = 
1, . . . , 151} with the actual sources S* so that the sum of the error distances between each estimated source and 
its match is minimized. Let this matching be denoted by the function vr, which matches each actual source Sj to 
If we have incorrectly estimated the number of infection sources, i.e., \S\ ^ \S*\, we add a penalty term to 
this sum. The average error distance is then given by 

^ /mm(|5*|,|S|) 

^ = Tcq d{s^(^i),Si) +r] 

I I \ i=l 

where r/ is a penalty weight for incorrectly estimating the number of infection sources. For different applications, 
we may assign different values to r] depending on how important it is to estimate correctly the number of infection 
sources. In our simulations, we consider the cases where rj = 0, and where rj is the diameter of the infection graph. 
The average error distances for the different types of networks are provided in Table |II] Clearly, the MSEP/MSEP- 
BFS algorithm outperforms the nSSE algorithm, even when the nSSE algorithm has prior knowledge of the number 
of sources. When |5*| is known a priori, the performance of the nSSE algorithm deteriorates with increasing 
This is to be expected as the SSE algorithm assumes that the node with the largest infection sequence count is the 
only source, and this node tends to be close to the distance center ll32l of the infection graph. The histogram of 
the average error distances when = are shown in Figure |6] 

The MSEP/MSEP-BFS algorithm also estimates the infection region of each source. To evaluate its accuracy, we 
first perform the matching process described previously. Let the true infection region of Si be A„ j and the estimated 
infection region of ,§7^(4) be A„ j, where we set A„ j = 0, if we have underestimated the number of sources and Si 
is unmatched. We define the correct infection region covering percentage for Sj as the ratio between \An^i fl A„ j| 
and |^n,i|' ^^'^ we compute the minimum (or worst case) infection region covering percentage as 

mm — — j . 

This is then averaged over all simulation runs. We find that the average minimum infection region covering 
percentage is more than 59% for all networks, as shown in Table |II1 



\S\ - \S* 



B. Real World Networks 

We verify the performance of the MSEP-BFS algorithm on the western states power grid network of the United 
States 1331. We simulate the infection spreading process on the power grid network, which contains 4941 nodes. 
For each simulation run, 1, 2 or 3 infection sources are randomly chosen from the power grid network under 
Assumption |2l and the spreading process is simulated so that a total of 500 nodes are infected. For each value of 
1 5*1, 1000 simulation runs are performed. The simulation results are shown in Figures [5] and |0d)| and Table JI] 
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|S*| = 2, MSEP 
|S*| = 2, nSSE 
|S*| = 3, MSEP 
IS* = 3, nSSE 
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(a) Geometric trees. 



(b) Regular trees. 
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(c) Small- world networks. 



(d) US power grid network. 



Fig. 6. Histogram of the average error distances for various networks. We assume 77 = and that the nSSE algorithm has prior knowledge 
of the number of infection sources. 



We see that the MSEP-BFS algorithm outperforms the nSSE algorithm in every scenario. The average infection 
region covering percentage is above 59%. 

C. Tests on Real Data 

In order to get some insights in the performance of the MSEP-BFS algorithm in real infection spreads, we 
conduct two tests on real infection spreads data. We first apply the MSEP-BFS algorithm to to a network of nodes 
that represent the individuals who were infected with the SARS virus during an epidemic in Singapore in the year 
2003. The data is collected using contact tracing of patients lH], where an edge between two nodes indicate that 
there is some form of interaction or relationship between the individuals (e.g., they are family members, classmates, 
colleagues, or commuters who shared the same public transport system). A part of the SARS infection network 
corresponding to a cluster of 193 patients is shown in Figure |7] We test the MSEP-BFS algorithm on the network 
in Figure |2l assuming that there are at most /cmax = 3 infection sources. It turns out that the MSEP-BFS algorithm 
correctly estimates the number of infection sources to be one, and correctly identifies the real infection source. 

We next consider the Arizona-Southern California cascading power outages in 2011 ||3T| . The affected power 
network is represented by a graph where a node represents a key facility (substation or generating plant) affected by 
an outage, and an edge between two nodes indicate that there is a transmission line between these two facilities. The 
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Fig. 7. Illustration of a cluster of the SARS infection network with a single source. 

cascading outage starts with the loss of a single transmission line. However, as indicated in ||3T1 , this transmission 
line alone would not cause a cascading outage. After the loss of this transmission line, instantaneous power flow 
redistributions led to large voltage deviations, resulting in the nuclear units at San Onofre Nuclear Generating 
Station being taken off the power grid. The failures of these two key facilities together serve as the main causes 
of the subsequent cascading outages, so these two facilities are considered as the two infection sources. The main 
affected power network containing 48 facilities is shown in Figure [8] We test the MSEP-BFS algorithm on the 
network in Figure |8] and assume that there are at most fcmax = 3 infection sources. We can see that the MSEP-BFS 
algorithm correctly estimates the number of infection sources to be two. We also found one of the sources correctly, 
and one estimate 1 hop away from the real source. 




Fig. 8. Illustration of the main affected power network with two infection sources. 

VI. Conclusion 

We have derived estimators for the infection sources and regions when the number of infection sources is bounded 
but unknown a priori. The estimators are based only on knowledge of the infected nodes and their underlying 
network connections. We provide an approximation for the infection source estimator for the class of geometric 
trees, and when there are at most two sources in the network. We show that this estimator asymptotically correctly 
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identifies the infection sources when the number of infected nodes grows large. We also propose an algorithm that 
estimates the number of source nodes, and identify them and their respective infection regions for general infection 
graphs. Simulation results on geometric trees, regular trees, small-world networks, the US power grid network, and 
experimental results on the SARS infection network and cascading power outages show that our proposed estimation 
procedure performs well in general, with an average error di stance of less than 4. The estimation accuracy of the 
number of source nodes is over 65% in all the networks we consider, with the geometric tree networks having an 
accuracy of over 90%. Furthermore, the minimum infection region covering percentage is more than 59% for all 
networks. Our estimation procedure assumes only knowledge of the underlying network connections. In practical 
applications where more information about the infection process is available, a more accurate and intelligent guess 
of the number of infection sources can be made. 

In this paper, we have adopted a simple SI infection model with homogeneous spreading rates, allowing us 
to derive analytical results that provide useful insights into infection source estimation for practical networks. 
However, this simplistic diffusion model does not adequately capture the real world dynamics of many networks. 
Future research includes the use of richer diffusion models that allow the inclusion of drifts and other dynamics 
in the infection spreading process, and tools from statistics to approximate optimal estimators for the infection 
sources. Our proposed algorithms find a set of nodes most likely to infect or influence a network, and are thus 
potentially useful for various practical applications. For example, our algorithm may be integrated with non-model- 
based consensus methods ||34l . ||35l to design multi-agent control systems that uses only a small subset of agents as 
controllers. In cloud-centric media platforms ||36l . variants of our proposed algorithm may be used for intelligent 
content cache management. These are all areas of future research. 

Appendix A 
Proof of Theorem [T] 

Let nodes that are infected by source Sj be colored with color i, with i = 1, . . . , k. Then a partition An corresponds 
to a coloring of the graph Hn, and to quantify the probability of a partition, it is sufficient to consider only infection 
sequences in the graph Hn. We have 

P{An\S,Gn)= Yl ^(^1^)' (20) 

where 

^{Hn, S,An) = {a £ ^{Hn, S) : u H An^i is an infection sequence, for all i = 1, . . . , k.], 

and a PI j4„ j is the subsequence of a containing only nodes that are in An i- 

Let h = \Hn\ — k, and consider an infection sequence a = {ai, . . . ,ah) S ^l{Hn, S, An)- Let the set of edges 
connecting susceptible nodes to infected nodes be called the susceptible edge set. We have assumed that the infection 
times of susceptible nodes are independent and identically exponentially distributed. Therefore, given the infection 
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sequence cti, . . . the next edge along which the infection is spread is chosen uniformly at random from the 

susceptible edge set at time index I — 1. Since Hn is a tree where all nodes except those in S have degree 2, after 
infection of a new node, the susceptible edge set size remains the same except in the case where the infected node is 
the last node to be infected amongst those on a path connecting two infection sources. In that case, the susceptible 
edge set size reduces by 2. Let be the set of indices of the last infected nodes on every path connecting infection 
sources. Letting = 1 if Z ^ and 2 otherwise, we then have 



Pia\S) = Ylnipiia\Hn,S) 
1=1 

h 

= 2Pllpi{a\Hn,S) (21) 



1=1 

where p is the number of paths connecting infection sources, and 

Pi ia\Hn,S)=lY, degH„ (^) " 2 J] l{j<i} ■ (22) 

Choose two sources Sj and sj and let m be the number of nodes in the path p{si,Sj) connecting Sj and Sj, 
excluding the source nodes. Suppose that r > [m/2] nodes in this path have color i. Construct a new coloring A'^ 
so that [?7i/2] nodes in p{si, Sj) closest to Si have color i and the rest have color j. The rest of the nodes in A'^ 
have the same colors as that in An- Each infection sequence a € Q{Hn, S, An) corresponds to an infection 
sequence a' € 0.{Hn, S, A'n), where the last x = r — [m/2] color-i nodes in a become the last x color- j 
nodes in a'. From (l22l ). we have pi{a \ Hn,S) = pi{a' \ Hn,S) for all /. Since (p^2]) — (™)' have 
\n{Hn,S,A'n)\ > \n{Hn,S,An)\, therefore (HOj yields P{A'n \ S,Gn) > P{An \ S,Gn). 

The same argument can be repeated a finite number of times for all paths in Hn connecting infection sources. 
This shows that the estimator An{S) is a Voronoi partition of G„, and the proof is complete. 

Appendix B 
Proof of Lemma |2] 

To simplify notations, we write Tu{si,S2) as T^, with the implicit understanding that all trees are defined w.r.t. 
{si, S2}. The number of infection sequences can be found by counting the number of ways to form such a sequence. 
The n — 2 slots in a sequence are occupied by nodes from Ts.\{sj}, i = 1,2, and ). Therefore, we have 

2 



i=l 

= -^^^^ . R{ui,Um) ■ n l^^'l''' 
Kp(«i,«™)l- ,;GT,,,i=l,2 

where R{ui, Uj) for i < j is the number of ways of permuting the nodes in Tp(„.^^) such that the infection sequence 
property is maintained, and the last equaUty follows from Lemma [T] To simplify the notations, for 1 < i < j < m. 
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let 

J{ui,Uj)= JJ \Tv\~^- 

feT'p(„..„^)\p(ni,Mj) 

For example, from Lemma H] we have C{ui \ TuJ = {\Tu^ \ — l)\J{ui,Ui). In the following, we show that for 

1 < i < j < m, 

R{ui,Uj) = |Tp(„^^„^)|! • q{ui,Uj;si,S2) ■ J{ui,Uj). (23) 

The proof proceeds by induction on j — i. If j = i, we have R{ui,Ui) = C{ui \ TuJ and the claim follows from 
Lemma [T] Suppose that the claim (1231 ) holds for all nodes Ufc and Up such that p — k < j — i. The number of 
permutations R{ui,Ui) can be computed by considering a sequence with m = slots. The first slot can 

be filled with either Ui or uj. Therefore, we have 

R{u,,u,) = {m- 1)! ( Cin.\Tu.)R{u^^uu,) ^ C{n, | T^J R{u,,u,^,) 



= {m- 1)! {J{ui,Ui)q{ui+i,Uj]Si,S2)J{ui+i,Uj) + J{uj,Uj)q{ui,Uj-i; si, S2) J^+i, Uj)) 
= (m - 1)! {q{ui+i,Uj; 81,82) + q{ui,Uj-i; 81,82)) J{ui,Uj), 

where the penultimate equality follows from the inductive hypothesis and Lemma [T] and the last equality follows 
by noting that J{ui,Ui)J{ui+i,Uj) = J{uj,Uj)J{ui+i,Uj) = J{ui,Uj). The claim (|23] ) now follows from ([8]l. 
Finally, © follows by an inductive argument using (|8j, which we omit. The proof is now complete. 

Appendix C 
Proof of Lemma [3] 

The proof follows easily from Theorems 5 and 6 of ||T41 . Consider the infection spreading along a path in G„. Let 
n(t) be the counting process of the number of infected nodes in this path. The process n(t) consists of exponentially 
distributed arrivals with rate 1, and at most one arrival with rate 2 if the path is between the two infection sources. 
Let ni(t) be a unit rate Poisson process corresponding to the rate 1 arrivals. Then Tli{t) < n(i) < Ili{t) + 1. 
From Theorem 6 of |[T4l . we have for any positive 7 < 0.2, 

p(n(t) < til - 7)) < p(ni(t) < t{i - 7) - 1) < exp (^-it(7 + i)2 

¥{U{t) > t{l + 7)) < ¥{Ui{t) > t{l + 7)) < exp (^-^t7^ 
The rest of the proof is the same as that of Theorem 5 of llT4l . and the proof is complete. 

Appendix D 
Proof of Theorem [2] 

We first show that under (ITST l. the interval ([19] is non-empty. The condition (fTSl) implies that 
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Fig. 9. Illustration of the network structure when uo 7^ vq. Not all nodes are shown, 
which after some algebraic manipulations yields 

b ((imin l)('^min 2) ^ dmaxj 
~ Kdmrn - 1) 2c ■ 

Therefore (1T9] | is a non-empty interval. Fix a in the interval. Then for all e > sufficiently small, we have 

bid^in-l){l + S) ^ 1 



H^min - 2) ^ 

2(1 + J)c 1-e' 

Recall that t is the time from the start of the infection spreading to our observation of G„. From Lemma |3] for 
each e, there exists to such that if t > to, we have 

(dmin- l)(l + 5)iVmin(t) 



'^max -^max (0 
('^min 2)A'jnin(t) 



> 1, (24) 

> 1. (25) 



2(1 + <^)iV^ax(t) 

We will make use of the two inequalities (l24l ) and (|25] ) extensively in the following proof steps. Let be the event 
defined in Lemma [3] Then from Lemma |3] we have for t > to, 

¥{S = S* I 5*) > P(5 = S* I 5*,^j)P(£'j I S*) > (1 - e)P(5 = S* \ S*,£t). (26) 

In the following, we show that P(5 = S \ S,£t) = ^ for t > to- The proof then follows from (|26] l as e can be 
chosen arbitrarily small. 

To show that F{S = S \ S,£t) = lis equivalent to showing that with probability one, C{S \ G„) > C{um, vi \ Gn), 
for all node pairs n^, vi € G„ such that at least one of them is not in 5. Let no and vq be the first nodes in p{si^ S2) 
that are connected to Um and vi respectively. We divide the proof into two cases, depending on whether no and vq 
are distinct or not, as shown in Figures |9] and [TOl 

Suppose that no 7^ no. A typical network for this case is shown in Figure |9] where m, Z,n,p, and k are non- 
negative integers, and at least one of Um and vi is not in S, i.e., either ?7i + />0orn + p>0. We let no = si if 
n = 0, and no = S2 if p = 0. 

We will show that if either m + / > or n + p > 0, we have for t > to, 

C{si,S2\Gn) C{si,S2\Gn) C{uo,VQ\Gn) 



C{um,vi I Gn) C(no,no | G„,) C(n rmVi \ Gn) 



> I. (27) 
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The proof follows by showing that C{uo,vq \ Gn) > C{um,vi \ Gn), where strict inequality holds if m + / > 0, 



and C(si,S2 | Gn) > C{uo,vo j G„) with strict inequahty holding if n + p > 0. From ([TTT l. we have 



C{uo,vo I Gn) Q{uo,vo) 



C{Um,Vl\Gn) Q{Um,Vl) 



n 



\Tw{uo,vo)\ ^ 



W£p{u,n,Ul)\Jp(vi,Vi) 

[2(l + (5)]-('"+') • 



r 

m+l 



n-+i'/*(no,^o) 
> [2(l + 5)]-('"+') - J]/, 

''^m) ^^Ol' l-^fo 



'W<^p(u„^,Ui)\Jp{vi ,Vi) 



> 



Um,vi) ■ JJ |r^(no,fo)| ^ 

j=l «)gp(u„,,«i)Up(Dj ,Dl) 

-1 m+l 



> 
> 1 



2(1 + 6) ■ max{\Tu,{uo,vo)\, \T^^{uo,vo)\} 

(dinax-2)Af„,in(t) + l^'"+' 



2(1 + 5) ■ iVmax(t) 



if m + / > 0. The first inequality follows because /^^^_,_j(nm, vi) > I*{uo,vq) for i = 1, . . . , /c + 2, and the last 
inequality follows from (1251 ) when t > to- 

Let = degQ{si) + deg(j(si). We have for t > to, 

(5(51,52 I G„) Q{si,S2) 



n 



C(uo,^^0 I Gn) Q{U0,V0) ^ . , , , 

^ ' ' iyep(si,xi)up(j/i,s2) 

fc+2 

n /*(no,^;o) 

= [2(1 + 



Tw{uQ,vo)\ 



n 



|7'«;(^iO,'yo)| 



n+p+fc+2 

n ^{S\,S2) wep{si,Xi)Up{yi,S2) 
i=l 
n+p+k+2 

>[2(l + 5)]"+P- n ^r(5i,S2)-'- n l^u,(no,«o)| 

j=fc+3 w&p{si,Xi)Up{yi,S2) 

2{l + 6)-mm{\Ts,{uo,vo)\,\Ts,iuo,vo)\}^''^P 



> 



> 



'0iVmax(t) + 2 
(1 + '5)(dn,in - 1) • iVmin(t) + 1 + 
'^max-^max(t) ~l~ 1 



> 1, 

where the first inequality follows because I*{uo,vq) > I*{si,S2) for i = 1, . . . , A; + 2, and the last inequality 
follows from (|24] | if n + p > 0. The bound (|27] ) is now proved. 

We next consider the case where uq = vq = wq in Figure [TOl where /e, m and I are non-negative integers. When 
t > to> we have the following bounds, which are straight forward to verify and whose proofs are omitted here. 

(i) /*(«„, vi) > (V - 2)iVmin(t) + 2 > {d„,in - 2)N^in{t) for i = 1, . . . , d{um,vi) + 1, 

(ii) I*{SI, S2) < VA^max(t) + 2 < 2(imaxA^max (t) + 2 for alH = 1, . . . , d(si, S2) + 1, 
''We define products over empty sets to be 1. 
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Fig. 10. Illustration of the case where uo = vo = wq. 




Fig. 11. A typical network for a single source tree. 



(iii) \T^^{u^,vi)\ > (V' - 2)iV,^i„(t) + 2 > {dmin - 2)iVmm(i) for alH = 1, . . . , A: - 1, 

(iv) \T^,{u„i,vi)\ > {d^in - ^)N,nm{t) + 1 for all w G p{si,S2), 

(v) |r^.(si, S2)| < iVmaxW for alH = 1, . . . , /c - 1, and 

(vi) |r^(si, S2)j < N^a.x(t) for all w e p{u 
The above bounds yield 

(7(81,52 I Gn) 
C{Um,Vl I Gn) 
_Q(si,S2) IlweG„\p{u,r,,vi) \Tw{Um,Vl)\ 



QiUm,Vl) UweG„\p{s,,s,) \Tw{si,S2)\ 
-{Z{L -\- 0)) »=1 



n£''^^^'/*(^i,^2) nti IT.. (51, 52)1 n 



\Tw(si,S2) 



> 



k-l 

n 

i=l 
(rf, 



\TwAsi,S2) 



(2(1 + 6)) 



-d{u,„,vi)- 



na{u, 
1=1 



d(u^,vi)+l 



It{Um,Vi) 



2)N^^(t) 



iVmax(i) 



k-l 



(dr. 



Ww(^p{u^,vi) \Tw(si,S2)\ 

2)iVmi„(t)l \{l + 5){(d 



(2(1 + 5)) 



d{si,SQ,)+ 



lT{w<^p{si,S2) \^w(Um-,Vl)\ 



_ 2(l + <5)iV^ax(t) . 



\t^:r^^'it(sus2) 

l)iVmin(t) + l)^'^'^''^^+' 



4 



max-^max 



(t) + l 



>1, 



where the last inequality follows from (|24l i and (I25l l. The theorem is now proved. 



Appendix E 
Proof of Theorem [3] 

Let t be the elapsed time from the start of an infection spreading from a single 5 to the time we observe Gn- We 
wish to show that Algorithm TSE estimates as sources 5 and one of its neighbors with probability (conditioned on 
s being the infection source) converging to 1 as t ^ cx). This is equivalent to showing that for t sufficiently large, 
and for each pair of nodes Um, vi € G„ where either d(um, 5) > 1 or d(vi, s) > 1, there exists a neighbor r of 5 
such that C(s,r \ Gn) > C(um,vi \ Gn)- 
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A typical network is shown in Figure [TT] where m and / are non-negative integers. If m, / and k are positive, 
we let r be the neighbor of s that lies on the path connecting s to Um (i.e., the node wi in Figure [TT]i. If m and 
/ are positive and /c = 0, then r is chosen to be either ui or vi. \f m = 0, we must have /c > so that Wk = Um 
and r = wi. A similar remark applies for the case / = 0. Note that m + I > 0. For t sufficiently large, we have 

n \Tw{Um,Vl)\ 

C{s,r\Gn) Q{s,r) w&G^\p{u„,,vi) 



C{Ura,Vi\Gn) Q{Um,Vl) U \T^{s,r) 

«'eG„\{s,r} 



l-(m+0 . Ul^l^^ IiiUm,Vl) _ Ilwep{s,w,^,) \Tw{Um,Vl)\ 



= f2(l + <5)l 

m=ii:is,r) UtZ2\T.,is,r)\- n |r.(s,r)| 

m+l TT'^~1 \^ { \\ 

= [2(1 + • n iK^m, vi) ■ ^^-=1 1^"'-^^"-'^'^! 

> \2(1 + . IT fu i;/)r+' \Ts{um,vi)\'' ^ 

^[Z{L-\-0)\ \lw,,{Um,Vl)\ (f\k-2 . M (f\m+l+l 

\ ro^i , ;:m*: [ ('^min - 1)^^^111111 (i) 1 '"^^''''^ ^ 
^[2(^ + ^)1 ■i2(l + 5).iV_(t). 

> 1, 

where the last inequality follows from (l25T l and Lemma |3] The proof of the theorem is now complete. 
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